Introduction
General introduction of the raw dataset:
Some data of squirrels in New York central park were collected
starting from October 6th 2018 to 20th over a 14-day period. Some of
their characteristics like ages and fur colors and some of the
activities like sounds and locations were recorded.
Motivation and initial question:
Squirrels are found everywhere, and it’s observed that some places
have more squirrels than others, but is there any trend of where they
stay with respect to their colors, ages, activities or all other
features? Doing an analysis using squirrel census data may answer the
question.
Main final goal:
The main final goals of the project are to make maps according to the
census and build functions to predict locations of particular squirrels
if their characteristics are provided so that people can use the website
to look for the kinds of squirrels they like.
Inspirations:
The maps-making process shown in class caught our eyes since it is a
clear and direct way to convey the information. A website called The
Squirrel Census (https://www.thesquirrelcensus.com/about ) did research
on the squirrels too, but it doesn’t provide any predictions, so we aim
to develop a prediction system. Also, in order to attract more people to
the website, some interactive plots will be made to fully introduce the
raw dataset. The census provides only the data of central park in 2018,
so other datasets like data of central park in other years or data in
2018 of other places will also be collected to make comparisons of any
location changes.
Method
Source:
The original raw data we used to analysis is from NYC Open Data, it
includes 3,023 observations in total and some of their characteristics
and corresponding locations are recorded
Preliminary work:
First off, the original raw dataset only includes the data from 2018
in central park, other two datasets were found as extra supporting
materials to compare with the raw model. One dataset contains
information about not only the squirrels in central park, but also in
the whole new york city. Another one is about characteristics and
behaviors of different animals and squirrels are also included.
Data cleaning:
For data tidy and cleaning, the categorical variables were
transformed to numeric ones for analysis and model building. We didn’t
discard the missing values or unknowns directly but recode them as 0
since there are a lot of them, and omitting them might may lose the
validity of the prediction. The dates were also cleaned.
We encoded squirrels’ activities under the “activity” column: If
activity = ”running”, the specific observations were recoded to “1”; If
activity = “eating”, the specific observations were recoded to “2”; If
activity = “foraging”, the specific observations were recoded to “3”; If
activity = “climbing”, the specific observations were recoded to “4”; If
activity = “chasing”, the specific observations were recoded to “5”.
We encoded squirrels’ interaction with humans under the “reaction”
column: If reaction = “indifferent”, the specific observations were
recoded to “1”; If reaction= “runs_from”, the specific observations were
recoded to “2”; If reaction= “approaches”, the specific observations
were recoded to “3”.
We encoded squirrels’ sounds under the “sounds” column: If sound =
“kuks”, the specific observations were recoded to “1”; If sound =
“quaas”, the specific observations were recoded to “2”; If sound =
“moans”, the specific observations were recoded to “3”.
We encoded squirrels’ primary fur color under the “primary_fur_color”
column: If color = “Gray”, the specific observations were recoded to
“1”; If color = “Cinnamon”, the specific observations were recoded to
“2”; If color = “Black”, the specific observations were recoded to
“3”.
We encoded whether the sighting session of squirrels occurred in the
morning or late afternoon under the “shift” column: If shift = “AM”, the
specific observations were recoded to “1”; If shift = “PM”, the specific
observations were recoded to “2”.
We encoded age groups of squirrels under the “age” column: If age =
“Adult”, the specific observations were recoded to “1”; If age =
“Juvenile”, the specific observations were recoded to “2”.
At last, we kept “unique_squirrel_id”, “hectare”, “shift”, “date”,
“heactare_squirrel_number”, “age”, “primary_fur_color”,
“highlight_fur_color”, “combination_of_primary_and_highlight_color”,
“location”, “lat_long”, “long”, “lat”, “activity”, “reaction”, and
“sounds” columns in tidied dataset to do the further data analysis and
model building.
Model building process:
Since the outputs are both longitudinal and latitudinal, we expected
to make two linear functions, with longitudinal and latitudinal being
the outputs separately against predictors, and combine the two outputs
in the end. We built several models for both longitudinal and
latitudinal outcomes using different methods (p-value, step-wise (both
backward and forward at the same time), criterion-based, and LASSO). The
following explanations are for longitudinal only and the latitudinal one
follows the exactly same procedures.
The first step is to throw all the numerical variables into the model
and check the p-value, the variables are shift + age + primary_fur_color
+ location + activity + reaction + sounds. Although hectare is also a
numerical variable, it’s not included because the users of the model
would not have the information of how many squirrels are there within a
specific hectare, but they only have the information about the
characteristics of specific squirrels that they want to look for. The
variables with p-value less than 0.05 were removed from the model, and
the model built with remaining variables was checked again to make sure
that all of them had p-value less than 0.05. So, the first model
candidate was produced with predictors being ‘shift’, ’ age’,
‘activity’, ‘reaction’, ‘sounds’.
Then, we selected model using automatic procedure, specifically
step-wise regression procedure. Backward, Forward or step-wise methods
might produce different results, but we chose to use step-wise since it
gives a single ‘best’ model. As the result, except for the location, all
other 6 variables are included in this model, which is the second model
candidate.
Next, we used criterion-based procedure. The model with the largest
adjusted R-square valued along with smallest AIC and BIC values are
chosen to be the model candidate. It turned out that it also had all
those 6 variables as the one in automatic procedure.
LASSO model selection method was then used. After looking for the
best lamda value, the third model candidate has all seven predictors,
which means no variable was deleted from the selection procedure.
We have three different models as the final ‘best’ model candidate
for now, and they are all nested within each other. We choose the ‘best’
model according two criteria, adjusted R-squared value and RMSE. For
longitudinal model, the final predictors have 6 predictors (shift + age
+ primary_fur_color + activity + reaction + sounds) since it has the
highest adjust R-squared value and pretty much similar RMSE distribution
as all other models.
As for the latitudinal model, it has 5 predictors (sounds +
primary_fur_color + reaction + activity + shift), but all other models
candidates have 6 predictors. Since the RMSE values and adjusted
R-squared are approximately same among all models, the principal of
parsimony tells us to choose the the most succinct model.
Statistical tests:
Results
Data summaries:
Except for the central park squirrel data, we also found another
dataset which contains the squirrels data in whole NYC so that we could
compare if there is a difference between squirrels in central park and
other places in New York.
The first graph we drew was ‘Number of Observations’ v.s. ‘Time of
Day’, and morning and afternoon data were separated and found out that
squirrels tended to be more active in the afternoon or at night time.
However, the limitation of the data was that we were not able to get the
exact time period of their activities but only either morning or
evening, we can assume they are present prior to sunset since they
should be busy collecting the food when there is sunlight.
The second graph we drew was ‘Number of Observations’ v.s. ‘Primary
Fur Color’, it’s clearly shown that different number of observations
were made in different days and there is no clear pattern. Squirrels
were observed to be the most active on Oct.7 and Oct.13, and they
clearly became less active in last few days. Generally, the gray
squirrels were the most massive and black ones were the fewest. The
color of cinnamon was also pretty frequently observed with some
color-not-identified ones.
The third graph we drew was a pie chart indicating the distribution
of squirrels by their physiological age. The majority (88.6%) of the
them was adult while the remaining 11.4% was juvenile. It’s not sure how
their age stage was determined by the observers, maybe by their sizes.
The limitation was that only ‘adult’ and ‘juvenile’ were categorized,
but the predictions might be more valid if other stages like ‘baby’ or
‘old’ were provided.
The fourth graph we drew was to show the distribution of only adult
squirrels by their primary fur color. The majority (83.6%) of the adult
squirrels were gray. 12.8% of them were cinnamon, and the rest 3.62%
were color of black.
The fifth graph we drew was to show the distribution of only juvenile
squirrels by their primary fur color. The distribution was similar as
the adult ones. 79.5% of the juvenile squirrels were gray. 18% of them
were cinnamon, and the rest 2.5% were black.
The sixth graph we drew was to show the activities in squirrels by
their different primary fur colors. No matter of the fur colors, they
tended to forage the most frequently and chase the least frequently,
which makes sense because squirrels needed to store foods during cold
months.
The last graph of central park dataset we drew was to show how
distributions of different activities differ by the locations. For
example, climbing happened above ground most of the time, but foraging,
running and eating basically happened on ground plane. Activities like
chasing has equal probabilities of happening both above ground and on
ground plane.
The first graph we drew about NYC data was to show the activities in
squirrels by their different primary fur colors. Constrained on the
activities indicated in the central park dataset, gray squirrels like to
climb the most frequently and chase the least frequently. Black and gray
squirrels like to forage the most and chase the least.
The second graph about NYC data was to show how distributions of
different activities differ by the locations. Activities like eating,
foraging and running most happened on the ground plane, but other
activities like chasing and climbing happened above ground for the most
of time.
The interactive map show the distribution of each squirrel by
different primary fur colors. Users could zoom in or out to find the
clusters of squirrels.
Main model:
Predictions of longitudinal and latitudinal use two separate
functions since they have different predictors.
The function of longitudinal has 6 predictors (shift + age +
primary_fur_color + activity + reaction + sounds) since it has the
highest adjust R-squared value and pretty much similar RMSE distribution
as all other functions. The formula of the longitudinal function is:
longitudinal = -0.7397 + 0.0005588shift - 0.0008995age -
0.0005278primary_fur_color - 0.0002236activity +
0.0005043reaction + 0.002109sounds
The function of has 5 predictors (sounds + primary_fur_color +
reaction + activity + shift). Since the RMSE values and adjusted
R-squared are approximately same among all functions, the principal of
parsimony tells us to choose the the most succinct model. The formula of
the latitudinal function is: latitudinal = 40.7810047 +
0.0006262shift - 0.0012940primary_fur_color -
0.0002411activity + 0.0007625interaction + 0.0029603*sounds
Comparisons:
We made comparisons between squirrels data in only central park and
whole NYC.
We found that the squirrels in central park, no matter of their
colors, like to forage the most and chase the least. However, squirrels
with different colors behaved differently in NYC. Gray squirrels like to
climb the most, but black and cinnamon squirrels tend to forage the most
frequently.
As for the activities in different locations, both central park and
whole NYC follow basically the same trend that activities like eating,
foraging and running were most likely on ground plane and other
activities happened above ground. However, activity of chasing was a
little bit different. In central park, chasing on ground plane is a
slightly more likely than above ground, but for whole NYC, about half of
the chasing are above ground.